122 research outputs found
Deep Multi-task Multi-label CNN for Effective Facial Attribute Classification
Facial Attribute Classification (FAC) has attracted increasing attention in
computer vision and pattern recognition. However, state-of-the-art FAC methods
perform face detection/alignment and FAC independently. The inherent
dependencies between these tasks are not fully exploited. In addition, most
methods predict all facial attributes using the same CNN network architecture,
which ignores the different learning complexities of facial attributes. To
address the above problems, we propose a novel deep multi-task multi-label CNN,
termed DMM-CNN, for effective FAC. Specifically, DMM-CNN jointly optimizes two
closely-related tasks (i.e., facial landmark detection and FAC) to improve the
performance of FAC by taking advantage of multi-task learning. To deal with the
diverse learning complexities of facial attributes, we divide the attributes
into two groups: objective attributes and subjective attributes. Two different
network architectures are respectively designed to extract features for two
groups of attributes, and a novel dynamic weighting scheme is proposed to
automatically assign the loss weight to each facial attribute during training.
Furthermore, an adaptive thresholding strategy is developed to effectively
alleviate the problem of class imbalance for multi-label learning. Experimental
results on the challenging CelebA and LFWA datasets show the superiority of the
proposed DMM-CNN method compared with several state-of-the-art FAC methods
Learning Speech Representation From Contrastive Token-Acoustic Pretraining
For fine-grained generation and recognition tasks such as
minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic
speech recognition (ASR), the intermediate representations extracted from
speech should serve as a "bridge" between text and acoustic information,
containing information from both modalities. The semantic content is
emphasized, while the paralinguistic information such as speaker identity and
acoustic details should be de-emphasized. However, existing methods for
extracting fine-grained intermediate representations from speech suffer from
issues of excessive redundancy and dimension explosion. Contrastive learning is
a good method for modeling intermediate representations from two modalities.
However, existing contrastive learning methods in the audio field focus on
extracting global descriptive information for downstream audio classification
tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these
issues, we propose a method named "Contrastive Token-Acoustic Pretraining
(CTAP)", which uses two encoders to bring phoneme and speech into a joint
multimodal space, learning how to connect phoneme and speech at the frame
level. The CTAP model is trained on 210k speech and phoneme text pairs,
achieving minimally-supervised TTS, VC, and ASR. The proposed CTAP method
offers a promising solution for fine-grained generation and recognition
downstream tasks in speech processing
speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition
In recent years, the joint training of speech enhancement front-end and
automatic speech recognition (ASR) back-end has been widely used to improve the
robustness of ASR systems. Traditional joint training methods only use enhanced
speech as input for the backend. However, it is difficult for speech
enhancement systems to directly separate speech from input due to the diverse
types of noise with different intensities. Furthermore, speech distortion and
residual noise are often observed in enhanced speech, and the distortion of
speech and noise is different. Most existing methods focus on fusing enhanced
and noisy features to address this issue. In this paper, we propose a
dual-stream spectrogram refine network to simultaneously refine the speech and
noise and decouple the noise from the noisy input. Our proposed method can
achieve better performance with a relative 8.6% CER reduction
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding
Recently, there has been a growing interest in text-to-speech (TTS) methods
that can be trained with minimal supervision by combining two types of discrete
speech representations and using two sequence-to-sequence tasks to decouple
TTS. To address the challenges associated with high dimensionality and waveform
distortion in discrete representations, we propose Diff-LM-Speech, which models
semantic embeddings into mel-spectrogram based on diffusion models and
introduces a prompt encoder structure based on variational autoencoders and
prosody bottlenecks to improve prompt representation capabilities.
Autoregressive language models often suffer from missing and repeated words,
while non-autoregressive frameworks face expression averaging problems due to
duration prediction models. To address these issues, we propose
Tetra-Diff-Speech, which designs a duration diffusion model to achieve diverse
prosodic expressions. While we expect the information content of semantic
coding to be between that of text and acoustic coding, existing models extract
semantic coding with a lot of redundant information and dimensionality
explosion. To verify that semantic coding is not necessary, we propose
Tri-Diff-Speech. Experimental results show that our proposed methods outperform
baseline methods. We provide a website with audio samples
- …